Skip to content

feat: add HttpAgent, per-step evaluation, and lightweight trace export#118

Merged
abrichr merged 1 commit into
mainfrom
feat/platform-infra
Mar 16, 2026
Merged

feat: add HttpAgent, per-step evaluation, and lightweight trace export#118
abrichr merged 1 commit into
mainfrom
feat/platform-infra

Conversation

@abrichr
Copy link
Copy Markdown
Member

@abrichr abrichr commented Mar 16, 2026

Summary

Three platform infrastructure features for generalizable agent integration:

  • HttpAgent (agents/http_agent.py): Generic agent-as-HTTP-service. Any team can deploy their agent stack as an HTTP server — the orchestrator sends {screenshot, instruction, viewport} to POST /act and gets back a BenchmarkAction. Cleanly solves the GPU/CPU separation problem (models need GPUs, WAA VMs need nested-virt CPU instances). Includes health check, graceful error handling, and optional /reset notification.

  • Per-step evaluation in RLEnvironment: New evaluate_every_step=True parameter calls the WAA evaluator after each step and populates info["evaluation_score"]. The reward signal is NOT changed (stays 0.0 mid-episode) — training code decides how to use the per-step evaluation data. Evaluation errors are caught gracefully.

  • LightweightTraceExporter: Plain JSON + screenshots trace export with no openadapt-ml dependency. Produces episode JSON files, manifest, and JSONL training samples in a universal format that any training pipeline can consume.

Test plan

  • 21 tests for HttpAgent (act, reset, health_check, error handling, action parsing)
  • 5 tests for per-step evaluation (enabled/disabled, reward unchanged, error handling)
  • 8 tests for lightweight trace export (schema, filtering, coordinate normalization, JSONL)
  • All 34 new tests pass
  • All 984 existing tests unaffected

🤖 Generated with Claude Code

Three platform infrastructure features:

1. HttpAgent (agents/http_agent.py): Generic agent-as-HTTP-service that
   forwards observations to any remote endpoint and parses BenchmarkAction
   responses. Enables teams to deploy custom agent stacks (model + prompt +
   parsing) as black-box HTTP servers, cleanly solving GPU/CPU separation.

2. Per-step evaluation in RLEnvironment: New evaluate_every_step parameter
   calls the WAA evaluator after each step and populates
   info["evaluation_score"]. Does NOT change the reward signal — training
   code decides how to use it. Useful for online RL training loops.

3. LightweightTraceExporter: Plain JSON + screenshots trace export with no
   openadapt-ml dependency. Produces episode JSON, manifest, and JSONL
   training samples in a universal format.

All 34 new tests pass. 984 existing tests unaffected.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@abrichr abrichr force-pushed the feat/platform-infra branch from 95ebea5 to c12e097 Compare March 16, 2026 21:19
@abrichr abrichr merged commit e820c0a into main Mar 16, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant